A Bootstrap Test for Comparing Performance
نویسندگان
چکیده
Experimental trials of programs are sometimes aborted when resource bounds are exceeded. The data from these trials are called censored data. This paper discusses the inferences that can be drawn from samples that include censored data. A key component of statistical inference, the sampling distribution, is generally not known for censored samples. However, the bootstrap procedure has been applied to estimate empirically the sampling distributions of many statistics. We show how to use the bootstrap to estimate the sampling distributions of the difference of means of two censored samples, enabling many comparisons that were previously ad hoc, such as the comparison of run times of algorithms when some run times exceed a limit. The reader will see how to extend the bootstrap to other tests with censored data. We also describe a test due to Etzioni and Etzioni for the difference of two censored samples. We show that the bootstrap test is more powerful, primarily because it does not make a strong guarantee that is a feature of the Etzioni's test. Cohen and Kim. Bootstrap tests for comparing two samples with censored data. Cohen and Kim. Bootstrap tests for comparing two samples with censored data. 1 The Problem of Censored Data The subject of this paper is how to measure and make inferences about the performance of a program when trials of the program are occasionally aborted. This happens when resource bounds are exceeded; for example, when a program runs out of time or space before solving a problem. Imagine running ten trials of a search algorithm, recording the number of node expansions required to find a goal node if that number is less than 5000 and abandoning the trial otherwise. A hypothetical sample distribution of the number of node expansions, n , is: Trial 1 2 3 4 5 6 7 8 9 10 Nodes 287 610 545 400 123 5000 5000 601 483 250 Table 1. A sample that includes two censored data. Two of the trials were abandoned and the numbers we record in these cases (5000) are called censored data. Censored data present no problems for descriptive statements about the sample, but they make it difficult to draw more general inferences. Provided we limit ourselves to the sample we can say, for example, that the mean number of nodes expanded in the previous ten trials is n = ni i=1 10 ∑ 10 ( ) = 1329.9. If we are disinclined to include the censored data in the average1, then we can leave them out and simply report the mean number of nodes expanded after the censored data are discarded: n = ni i≠6,7 10 ∑ 8 ( ) = 412.375. We run into problems, however, when we attempt to generalize sample results. For example, it is unclear how to infer the "population" mean number of nodes that would be expanded by the previous algorithm if we ran other experiments with ten trials. Statistical theory tells us how to make this generalization if no data are censored: the best estimate of the population mean is the sample mean. But our sample includes censored data, and we should not infer that the population mean is 1329.9, because we do not know how many nodes the censored trials might have expanded if we had let them run to completion. Nor should we infer that the population mean of uncensored trials is 412.375 because statistical theory does not explain the relationship between the mean of a sample that includes censored data and the mean of a population. We can draw no conclusions that depend on inferring the population mean; for example, we risk biased results if we try to infer that one algorithm expands significantly fewer nodes than another [7]. This paper describes a general method for drawing inferences from samples that include censored data. The method is an application of bootstrap resampling, a Monte Carlo technique for estimating sampling distributions of statistics [2,6]. We present two tests—one to tell us whether the mean of a sample is significantly different from a particular value, the other to determine whether two samples are significantly different; the reader will easily see how to construct other tests, including tests that depend on statistics 1 In this example, the abandoned trials expanded more than ten times as many nodes as the others, which suggests that they are somehow different and not really comparable with the others, and should be left out of the sample. Cohen and Kim. Bootstrap tests for comparing two samples with censored data. 2 other than the mean. We compare our two-sample test to one designed by Etzioni and Etzioni [5], and we show empirical power curves from which we conclude that our test is more powerful in many conditions. Bootstrap resampling is well-known and our one-sample test is similar in some respects to Efron's discussion of the sampling distribution of the trimmed mean [4]. The contributions of the paper are the two-sample test and comparisons with Etzioni and Etzioni's test, and bringing the constituent techniques to the attention of the AI community. Background: Sampling Distributions Statistical tests are commonly tests of whether sample results are unusual. Imagine we have two search algorithms, A and B, and two samples of ten trials for each algorithm. We want to know whether A expands significantly more nodes than B. A common way to answer the question is to subtract the sample mean number of nodes expanded by A, nA, from the same statistic for B's sample, nB , and ask whether nA − nB is unusually large or small, given the null hypothesis, H0:μ A = μ B that the population means of the number of nodes expanded by A and B are equal. If the sample result, nA − nB, is unusual we reject the null hypothesis; we say μ A is probably not equal to μ B , or algorithm A expands a significantly different number of nodes, on average, than algorithm B. To say that a sample result is unusual, we must know the sampling distribution of the result: the probability distribution of all possible sample results, calculated from samples of a fixed size, given the null hypothesis, H0 . For example, the sampling distribution of nA − nB, given H0:μ A = μ B , is the probability distribution of all possible values of nA − nB that might be obtained by drawing samples of a fixed size from two populations with equal means. You can imagine what the sampling distribution looks like: small differences are likely (because H0 says the population means are equal) and large positive and negative differences are unlikely. The sampling distribution of nA − nB looks like a bell curve although it is not Gaussian. Rather, the sampling distribution of the difference of two means is a t distribution. To see whether a sample result is unusual, one simply converts it to a t statistic and sees where the statistic falls in the t distribution. If the t statistic falls in one of the tails of the distribution (as shown in Figure 1) then we know that the corresponding sample result has a relatively low probability, and we reject the null hypothesis. Statisticians showed long ago that the t distribution is the sampling distribution of the difference of two means under the null hypothesis that the population means are equal, but no comparable results tell us the sampling distribution if the samples contain censored data. Moreover, for reasons discussed in [2,7], if we ignore the censored data, we get biased sampling distributions. Thus we cannot tell whether sample results are unusual, nor can we test hypotheses, at least, not by conventional means. The bootstrap resampling technique provides a way to estimate the sampling distribution of any statistic, given only the sample. In particular, bootstrapping permits us to estimate the sampling distribution of unusual statistics such as "the mean of all the sample values less than 5000," and, thus, the sampling distributions of statistics from samples with censored data. Cohen and Kim. Bootstrap tests for comparing two samples with censored data. 3 α = .025 Step 1: Convert the sample result, to a t statistic; in this case, t=2.4 Step 2: Find the area of the t distribution above 2.4; this is the probability of attaining the sample result by chance if is true, and is denoted , the probability of incorrectly rejecting The sampling distribution of the difference of two means
منابع مشابه
Comparing two testing procedures in unbalanced two-way ANOVA models under heteroscedasticity: Approximate degree of freedom and parametric bootstrap approach
The classic F-test is usually used for testing the effects of factors in homoscedastic two-way ANOVA models. However, the assumption of equal cell variances is usually violated in practice. In recent years, several test procedures have been proposed for testing the effects of factors. In this paper, the two methods that are approximate degree of freedom (ADF) and parametric bootstr...
متن کاملEvaluating Information Retrieval Metrics Based on Bootstrap Hypothesis Tests
This paper describes how the bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. More specifically, we describe straightforward methods for comparing the discriminative power of IR metrics based on Bootstrap Hypothesis Tests. Unlike the somewhat ad hoc Swap Method proposed by Voorhees and Buckley, our Bootstrap Sensitivity Methods estimate the overall ...
متن کاملA ricle Tests for Two Trees Using Likelihood Methods
This article considers two similar likelihood-based test statistics for comparing two fixed trees, the Kishino-Hasegawa (KH) test statistic and the likelihood ratio (LR) statistic, as well as a number of different methods for determining thresholds to declare a significant result. An explanation is given for why the KH test, which uses the KH test statistic and normal theory thresholds, need no...
متن کاملPower Analysis for the Likelihood-Ratio Test in Latent Markov Models: Shortcutting the Bootstrap p-Value-Based Method.
The latent Markov (LM) model is a popular method for identifying distinct unobserved states and transitions between these states over time in longitudinally observed responses. The bootstrap likelihood-ratio (BLR) test yields the most rigorous test for determining the number of latent states, yet little is known about power analysis for this test. Power could be computed as the proportion of th...
متن کاملA New Robust Bootstrap Algorithm for the Assessment of Common Set of Weights in Performance Analysis
The performance of the units is defined as the ratio of the weighted sum of outputs to the weighted sum of inputs. These weights can be determined by data envelopment analysis (DEA) models. The inputs and outputs of the related (Decision Making Unit) DMU are assessed by a set of the weights obtained via DEA for each DMU. In addition, the weights are not generally common, but rather, they are ve...
متن کاملWORKING PAPERS SERIES WP04-16 Predictive Density Accuracy Tests
This paper outlines a testing procedure for assessing the relative out-of-sample predictive accuracy of multiple conditional distribution models, and surveys existing related methods in the area of predictive density evaluation, including methods based on the probability integral transform and the Kullback-Leibler Information Criterion. The procedure is closely related to Andrews’ (1997) condit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993